NSF PAR Search | NSF Public Access Repository

Leveraging Large Language Models for Evaluating Explanations in Math Education. Learning Analytics and Knowledge

Worden, E; Croteau, E; Cheng, L; McReynolds, A; Heffernan, N (January 2024, LAK 2024 (submitted, in review))

Human-conducted rating tasks are resource-intensive and demand significant time and financial commitments. As Large Language Models (LLMs) like GPT emerge and exhibit prowess across various domains, their potential in automating such evaluation tasks becomes evident. In this research, we leveraged four prominent LLMs: GPT-4, GPT-3.5, Vicuna, and PaLM 2, to scrutinize their aptitude in evaluating teacher-authored mathematical explanations. We utilized a detailed rubric that encompassed accuracy, explanation clarity, the correctness of mathematical notation, and the efficacy of problem-solving strategies. During our investigation, we unexpectedly discerned the influence of HTML formatting on these evaluations. Notably, GPT-4 consistently favored explanations formatted with HTML, whereas the other models displayed mixed inclinations. When gauging Inter-Rater Reliability (IRR) among these models, only Vicuna and PaLM 2 demonstrated high IRR using the conventional Cohen’s Kappa metric for explanations formatted with HTML. Intriguingly, when a more relaxed version of the metric was applied, all model pairings showcased robust agreement. These revelations not only underscore the potential of LLMs in providing feedback on student-generated content but also illuminate new avenues, such as reinforcement learning, which can harness the consistent feedback from these models.

Full Text Available

We present a conversational AI tutor (CAIT) for the purpose of aiding students on middle school math problems. CAIT was created utilizing the CLASS framework, and it is an LLM fine-tuned on Vicuna using a conversational dataset created by prompting ChatGPT using problems and explanations in ASSISTments. CAIT is trained to generate scaffolding questions, provide hints, and correct mistakes on math problems. We find that CAIT identifies 60% of correct answers as correct, generates effective sub-problems 33% of the time, and has a positive sentiment 72% of the time, with the remaining 28% of interactions being neutral. This paper discusses the hurdles to further implementation of CAIT into ASSISTments, namely improved accuracy and efficacy of sub-problems, and establishes CAIT as a proof of concept that the CLASS framework can be applied to create an effective mathematics tutorbot.

Search for: All records